feat(redteam): add built-in red teaming support#184
Conversation
8d7d3f5 to
c9f5845
Compare
poshinchen
left a comment
There was a problem hiding this comment.
Could you use built-in python | / list instead of typing's deprecated Union, List and so on?
Quick heads-up – fixed it in 438f9e0 |
|
Created sub-issues under #177 to track P0/P1 work:
This PR (#184) covers the infrastructure layer of P0. Checked items in #220 reflect what's already implemented here:
Remaining P0 work (tracked in #220):
|
|
@poshinchen Resolved your comments in d750fe0:
|
|
/strands review the PR |
|
Issue: This PR introduces a significant new public API surface ( The PR description documents the components well, but missing from an API review perspective:
Suggestion: Add from strands_evals.redteam import red_teamand document whether this is the intended public entry point. |
Review SummaryAssessment: Comment (Request Changes on specific items) Solid foundation for red teaming capabilities. The architecture cleanly separates concerns (presets vs. strategies vs. evaluators vs. runner) and the Review Categories
The separation of "what to attack" (presets) from "how to attack" (strategies) is a clean design that should scale well as more strategies land in the follow-up PRs. |
Review Summary (Round 4)Assessment: Request Changes All Round 3 items were addressed well. However, a correctness issue remains with the shared target Agent state across cases. Review Details
The first issue (agent state isolation) is the only blocker — it affects correctness of all multi-case runs. The other two are defensive improvements. |
c3d82fb to
16aa4b0
Compare
|
Also, does the experiment return |
Returns a single |
Review Summary (Round 5)Assessment: Comment (Approve with minor fixes) All critical and important issues from Round 4 (agent state isolation, async max_workers, None guard) have been properly addressed. The module is in good shape. Remaining Items
Neither item is blocking. The architecture is clean, test coverage is solid (7 test files covering all major components), and the layered design properly reuses existing framework primitives. |
Adds an experimental red-teaming module under src/strands_evals/experimental/redteam/ that extends Strands Evals base types (Case, Experiment, Evaluator, ActorSimulator) with adversarial counterparts. - AdversarialCaseGenerator: generates RedTeamCases per risk category, with optional auto-inference of categories from target tools/system_prompt - RedTeamExperiment: orchestrates multi-turn attacker/target conversations - AttackSuccessEvaluator: continuous 0.0-1.0 LLM-as-judge over conversation + tool execution traces - AdversarialActorSimulator: ActorSimulator subclass shared across strategies - AttackStrategy + PromptStrategy with gradual_escalation as the default
16aa4b0 to
1ecc290
Compare
Review SummaryAssessment: Approve (with minor suggestions) The module has matured significantly through 5+ prior review rounds. All critical issues from earlier rounds (agent state isolation, concurrency safety, None guards, assert-for-validation) are resolved. The architecture cleanly composes existing framework primitives and the test coverage is thorough (762 lines of tests across 7 test files for 1198 lines of source). Remaining Suggestions
None of these are blocking. The |
jjbuck
left a comment
There was a problem hiding this comment.
Approved with just a few non-blocking nits noted for eventual transition from experimental to main.
poshinchen
left a comment
There was a problem hiding this comment.
Let's iterate the action items in the follow up PRs
* fix(redteam): align log format and cover dict-target path Carry-over nits from PR #184: - Align 8 log calls in task.py and generators/adversarial.py to the project's field=<%s> | message convention (no punctuation/capitals). - Add unit tests for the _call_target dict-target branch (with and without a trace key), which was previously untested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(redteam): unify strategies on run_attack with strategy-agnostic cases Every AttackStrategy now owns its multi-turn loop via an abstract run_attack(case, call_target, ...) -> AttackRunResult; the task runner injects call_target (target invocation + tool-trace capture + per-case messages.clear isolation) and no longer branches on strategy type. Why: a single execution model (strategy owns its loop) is simpler than a runner-owned loop plus a per-strategy exception. Cases become strategy-agnostic (no strategy/template baked into RedTeamConfig); the RedTeamExperiment holds the strategy instances and expands the case x strategy cross-product at run time, so hand-crafted cases and strategy comparison (by label) are both first-class. - base.py: run_attack @AbstractMethod + AttackRunResult dataclass; add label (instance id, defaults to name); remove the unused enhance(). - PromptStrategy: relocate the ActorSimulator loop from task.py into run_attack (gradual_escalation behavior unchanged). - RedTeamConfig: drop strategy/system_prompt_template + their validator. - generators/adversarial: generate_cases emits strategy-agnostic cases; rename target -> agent; drop attack_strategies. - experiment: rename target -> agent; accept attack_strategies; build _by_label (duplicate label -> ValueError); expand cross-product before delegating to the base worker (left untouched). - task: build call_target, look up the case's strategy by label, map AttackRunResult to the {"output", "trajectory", ...} dict. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(redteam): add Crescendo multi-turn attack strategy CrescendoStrategy escalates gradually across turns, each attacker message building on the target's previous answer. On a refusal it backtracks by simply not appending the refused (question, response) pair and retrying with a fresh question (up to max_backtracks), so the refused turn never enters the history — a simpler equivalent of PyRIT's excluding-last-turn approach. It stops early once a turn scores at/above success_threshold. The refusal/success/question-generation helpers are module-level functions (is_refusal, success_score, gen_escalating_question) rather than methods, so future strategies (PAIR, TAP) can reuse them without importing a strategy class. They power the strategy's cheap in-loop "should I stop?" gate; success_score reads the case's success_criteria — the same input the authoritative AttackSuccessEvaluator uses — so the two never disagree on what counts as success, while the evaluator remains the sole verdict over the full trace. Parse failures degrade safely (question -> terminate preserving the conversation; judge -> score 0 and keep looping); only the evaluator raises. The attacker model resolves to the ctor model first, then the experiment model. CrescendoStrategy is exported but intentionally NOT in BUILTIN_STRATEGIES (it is user-instantiated with params). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(redteam): add per-failure drill-down to the report The aggregate sections (top-line attack-success rate, by_risk_category, by_strategy when more than one strategy ran) are unchanged. Each failure line now also shows the attacker's objective and the strategy's per-run stats (turns used, backtracks) so a multi-turn result like Crescendo is legible at a glance, not just a single score. The strategy's run metadata reaches the report by merging AttackRunResult.metadata onto the case metadata in the task function; the base Experiment shares that dict with the EvaluationData it builds, so no base change is needed. Full turn-by-turn conversation output is left for a future verbose mode. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(redteam): end-to-end wiring tests + fix strategy metadata join Cover both user paths through the full locked interface with only the LLM layer mocked (attacker, in-loop judge, and the evaluator's judge agent): - generated cases: generate_cases(agent=...) -> RedTeamExperiment with CrescendoStrategy -> run_evaluations -> RedTeamReport. - hand-crafted cases: the same pipeline from RedTeamCase objects built by hand, skipping the generator (Model B's first-class path). Live (real-Bedrock) runs surfaced a wiring bug these mock tests now guard: the strategy's run metadata (turns_used, backtracks) never reached the report. task_fn mutated case.metadata, but Pydantic copies that dict into a fresh EvaluationData, and the base Experiment doesn't carry task-returned metadata anyway. Fix: the experiment now collects each case's run metadata (keyed by case name) and joins it onto the report in RedTeamReport.from_evaluation_reports — keeping the base untouched and the collection logic on the RedTeamExperiment layer (where it stays put if the experiment later stops extending the base). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(redteam): address pre-merge review (idempotency, refusal accuracy, leaks) Adversarial self-review before opening the PR surfaced two correctness bugs and several maintainability issues; fixing them here. Correctness: - Cross-product expansion was mutating self._cases in place, so re-running an experiment squared it (c0__cre -> c0__cre__cre). _expand_cross_product is now pure (returns a new list) and run_evaluations_async swaps/restores self._cases around the base run, making reruns idempotent. - is_refusal flagged compliant text containing refusal substrings ("I cannot stress enough... here are the steps", "I apologize, here is..."), dropping successful attacks from the trace and biasing results toward "attack failed". Markers are now only a cheap negative prefilter; on a marker hit a refusal judge (the previously-unused REFUSAL_JUDGE_SYSTEM_PROMPT) disambiguates, with a safe "keep the turn" fallback on parse failure. Maintainability: - Removed the leaky AttackRunResult.trajectory field (the task owns the trace via call_target); task_fn now assembles the output/trajectory payload directly. - Unified turns_used to "turns kept in the conversation" across strategies; Crescendo additionally reports target_calls (incl. refused, backtracked calls). - Documented max_turns as an experiment-level ceiling (strategy runs min of the two), the no-success_criteria behavior, and the max_workers=1 requirement; run_evaluations_async now rejects max_workers != 1 instead of relying on a comment. - Dropped the now-unused resolve_strategy/DEFAULT_STRATEGY public surface. Tests: idempotency, refusal false-positives + judge disambiguation, all-refusal empty conversation, ctor-vs-injected max_turns both directions, no-criteria run, direct async entry + coroutine/max_workers guards; e2e now asserts exact turns_used/backtracks with an engaging (non-refusal) target. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(redteam): add report.display(verbose=True) to show failure conversations An LLM judge can be fooled by a target that *claims* to leak — e.g. a target that, under escalation, emits a code block it presents as "my system prompt" which may be partly hallucinated. The aggregate report can't be verified by eye without the transcript. display(verbose=True) now prints each failed case's full attacker/target conversation (default stays the compact aggregate + one-line drill-down), so a user can confirm whether a flagged "success" is a real leak or a false positive. The conversation is carried on AttackResult.conversation (from the case's actual_output). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(redteam): strategies own their turn budget; drop experiment max_turns The experiment's max_turns (default 10) silently capped every strategy via min(strategy_max_turns, experiment_max_turns), so CrescendoStrategy(max_turns=30) under the default experiment ran only 10 turns — quietly breaking the compare-same-strategy-different-params use case. Each strategy now owns its turn budget; the task passes MAX_ALLOWED_TURNS (50) as a hard ceiling, so turn_cap = min(strategy.max_turns, 50). Removed max_turns from RedTeamExperiment.__init__ entirely. Added max_turns to PromptStrategy so gradual_escalation keeps its prior default of 10 (and its {max_turns} prompt text) rather than jumping to the ceiling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(redteam): second adversarial pass + address reviewer bot feedback - judges score each response statelessly (clear history per call) so earlier turns don't bias the in-loop refusal/success verdicts - correct backtrack docstring: it is report-scope only, the target's own context is not rolled back; add a proof test - drop dead keys from task_fn return dict (base reads only output/trajectory) - export AttackRunResult publicly (part of the strategy extension contract) - remove unused system_prompt_template from base AttackStrategy - fix log-statement separators; extract dense metadata merge into locals - add hardening cross-ref comments (lazy-init attacker, _cases swap) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(redteam): add TargetSession protocol and implementations Introduce TargetSession (invoke/snapshot/restore/supports_rewind/trace) as the handle a strategy uses to talk to the target, replacing the opaque call_target in a follow-up. AgentTargetSession wraps a strands.Agent and is rewindable via the SDK snapshot API (deep-copy rollback); CallableTargetSession wraps an opaque callable and reports supports_rewind=False. Bumps strands-agents floor to >=1.36.0 for the snapshot API. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(redteam): replace call_target with TargetSession across strategies Switch run_attack from call_target: Callable[[str], str] to a TargetSession, so a strategy can roll the target back via the session's snapshot/restore. Crescendo now does a real state rollback on a refusal for rewindable (Agent) targets and degrades to report-scope backtracking for opaque callables; both keep the refused turn in AttackRunResult.pruned_branches as defended-turn evidence. The report surfaces that evidence: display() is flattened to a case x strategy matrix plus a per-attack table (every attack, breached and defended), closing the gap where a fully-defended run looked empty. Score aggregation across evaluators switches min -> max (worst-case = strongest attack). The trace is rolled back alongside messages so backtracked tool calls no longer ghost into the trajectory. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(redteam): address adversarial-review findings on TargetSession + report - Matrix now pivots on the base case name: the cross-product names work items "{case}__{strategy}", so without stripping the suffix every cell landed on its own row and the case x strategy grid was meaningless (AR-5). - Replace the snapshot.app_data["_trace_len"] mutation with an explicit TargetCheckpoint(agent_snapshot, trace_len) dataclass returned by snapshot() and consumed by restore() — no stashing internal keys on the SDK object, and trace/messages roll back together (AR-1/AR-2). - Move per-case isolation into TargetSession.reset() (clears the wrapped agent's history + trace) instead of task.py reaching into agent.messages (AR-7). Verified live against Bedrock: backtrack still rolls back (backtracks=4, blocked=4) and the 2x2 cross-product matrix renders one row per case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(redteam): second-pass review cleanups on TargetSession + report - Remove a duplicated paragraph in the crescendo module docstring. - Add `from __future__ import annotations` to target_session.py to match every sibling module and keep the TargetCheckpoint forward reference safe. - Export the session companions consistently: TargetCheckpoint joins TargetSession on the redteam facade (both are part of the strategy contract, like AttackRunResult); the two concrete impls are exported at the strategies package. - Qualify the backtrack docstring/comment to "the target's state" — the attacker agent keeps its own history (a known, separate quality limitation). - Parameterize trace annotations as list[dict[str, Any]] to match the strategies layer. - Guard the report matrix against a base-case/strategy key collision: if stripping the cross-product suffix would hide a result, fall back to full names. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(redteam): add TargetSession.trim_trace; tidy flat-table case column - Add trim_trace(length) to the TargetSession protocol so a strategy rolls the tool trace back through the session instead of mutating session.trace directly (addresses the bot review: the protocol never promised .trace returns a mutable reference, so a defensive-copy impl would have silently ghosted refused-turn tool calls). AgentTargetSession.restore now reuses trim_trace; Crescendo's non-rewindable backtrack calls it instead of `del target_session.trace[...]`. - Report flat table / transcript header show the base case name (the strategy column already disambiguates the cross-product), so the full "{case}__{strategy}" name no longer overflows the column into the risk field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(redteam): drop callable target, harden TargetSession contract - remove CallableTargetSession; input is Agent | TargetSession (TypeError otherwise) - rename AgentTargetSession -> StrandsAgentSession - trace: property -> plain attribute; fold trim into restore() - split invoke into _send + _tool_uses_in; add ToolUseEntry TypedDict - crescendo: never backtrack a tool-call turn; stop on it (keep breach evidence) - resilient trace extraction (placeholder on malformed block keeps the gate honest) * test(redteam): use a real TargetSession in experiment wiring tests The lambda agents in test_experiment.py hit the new _build_session TypeError and passed only because the base experiment catches it as score=0 -- so the default-task and cross-product wiring was never actually exercised (run_attack was unreachable). Swap the lambdas for a _FakeSession so the intended paths run. * fix(redteam): reset target to clean baseline, not just messages StrandsAgentSession.reset() only cleared messages, but snapshot()/restore() round-trip the full session preset (messages, state, conversation_manager_state, interrupt_state). So agent state leaked across cases -- a tool writing agent.state in case N would still be set in case N+1, which can flip a later attack's outcome. The experiment now captures one clean baseline at task-build time (before the first case, while the shared agent is still as-constructed) and reset() rolls back through the same load_snapshot path restore() uses. Seeded target history is preserved (it's part of the target definition); per-case state is cleared. * test(redteam): pin baseline-reset invariants; tighten _build_session typing Follow-up to the reset fix after an adversarial review pass: - type _build_session(baseline) as Snapshot | None instead of Any (it feeds load_snapshot, so a non-Snapshot would only surface as a swallowed per-case error) - add a real-Agent test that one baseline survives repeated resets uncorrupted (the capture-once/replay-N aliasing risk), and a test locking the documented limitation that a no-baseline session does not isolate non-message state --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Description
Adds an experimental red-teaming module (
strands_evals.experimental.redteam) that lets users run multi-turn adversarial attacks against a target agent and score whether safety guardrails hold. The module composes existing Strands Evals primitives (Case,Experiment,Evaluator,ActorSimulator) rather than introducing a parallel framework.Two-step flow
1. Generate cases.
AdversarialCaseGeneratorinfers risk categories from the target's system prompt and tools, then generates per-category attack cases via an LLM. Custom cases can be authored directly viaRedTeamCase+AttackGoalfor domain-specific business rules.2. Run evaluations.
RedTeamExperimentdrives a multi-turn attacker conversation against the target, captures the full conversation + tool trace, and scores it with an LLM judge.What ships
doc §2.4).
Related Issues
Closes #220.
Type of Change
New feature (experimental module).
Testing
Checklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.